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Abstract. We present a new method to accelerate the 
HITS algorithm by exploiting hyperlink structure of 
the web graph. The proposed algorithm extends the 
idea of authority and hub scores from HITS by in- 
troducing two diagonal matrices which contain con- 
stants that act as weights to make authority pages 
more authoritative and hub pages more hubby. This 
method works because in the web graph good author- 
ities are pointed to by good hubs and good hubs point 
to good authorities. Consequently, these pages will col- 
lect their scores faster under the proposed algorithm 
than under the standard HITS. We show that the au- 
thority and hub vectors of the proposed algorithm ex- 
ist but are not necessarily be unique, and then give 
a treatment to ensure the uniqueness property of the 
vectors. The experimental results show that the pro- 
posed algorithm can improve HITS computations, es- 
pecially for back button datasets. 

Keywords: acceleration method, HITS, hyperlink struc- 
ture, power method, web graph 

1. Introduction 

HITS (Hypertext Induced Topic Search) is a ranking al- 
gorithm introduced by Jon Kleinberg in 1998 [1] that uti- 
lizes web graph's hyperlink structure to create two metrics 
associated with every page. The first metric, authority, de- 
termines page's popularity, and the second metric, hub, is 
used to find portal pages, pages that link to popular (thus 
useful) pages. Because it is easy to create many hyper- 
links on a page to boost its hub score (thus can increase 
authority scores of other pages that are pointed to by it), 
HITS is susceptible to link spamming problem. 

HITS is usually being compared to PageRank [2], a 
popular ranking algorithm used by Google that also uses 
the hyperlink structure to create a popularity measure. 
Both algorithms were breakthrough achievements at that 
time because unlike previous methods which usually use 
page's contents, these algorithms take different approach 
by utilizing the hyperlink structure to measure page's val- 
ues. However, there are two main differences that should 
be enlisted here. First, while HITS produces two metrics, 
PageRank only produces one, the popularity measure. 
Second, unlike PageRank, HITS is query-dependent; for 
every incoming query the algorithm first finds relevant 



pages (usually by matching terms in the query with the 
contents), builds neighborhood graph, and then calculates 
authority and hub scores for every page in the graph. 
The neighborhood graph is built by not only taking the 
relevant pages as the vertices, but also other pages that 
either point to or are being pointed to by the relevant 
pages. This expansion step allows semantic association 
to be made and usually solves synonym problem [1,3]. 
Unfortunately it also creates famous problem associated 
with HITS; topic drift, authoritative yet irrelevant pages 
are likely to be also included [1,3]. 

The link spamming problem can be alleviated by giv- 
ing only fractional weights to edges from mutually rein- 
forcement hosts [4]. The topic drift can be mitigated by 
computing the relevancy between the query and the pages 
in the neighborhood graph [4]; the more similar the pages 
to the query, the more weights they have. So the influence 
of less relevant pages can be reduced. 

The query-dependence is considered to be the most 
problematic aspect of HITS because authority and hub 
vectors have to be calculated online and real time for ev- 
ery incoming query, thus consuming too much computa- 
tional, memory, and network resources. This problem can 
be handled by modifying HITS to be query-independent; 
taking the entire web graph as the neighborhood graph 
and calculating a global authority and a global hub vec- 
tor [3]. However, some crucial problems faced by Page- 
Rank in the early development like storage issues, mem- 
ory management, tasks division, parallelization strategies, 
and computational methods must be addressed before 
this task becomes possible. Fortunately, mathematically 
query-independent HITS (QI-HITS) resembles PageRank 
[29], so it can be expected that infrastructures and meth- 
ods built for PageRank can be adopted to QI-HITS. 

The challenge of accelerating QI-HITS is not a trivial 
problem. There are several good reasons to put some ef- 
forts on it. First, QI-HITS has some nice properties: (1) 
like PageRank, it can be calculated offline so the system 
doesn't have to deal with every incoming query and some 
resources can be saved. (2) Unlike PageRank, it gives 
two measures; authority scores for finding popular pages 
and hub scores for finding portal pages. And (3) it solves 
completely the topic drift problem and slightly reduces 
the link spamming problem. Second, as the web graph 
is growing rapidly the needs for faster methods are in- 
evitable. For example to keep the freshness of web in- 
dices, to save the resources, and to build personalized 
and topic-sensitive schemes as in the PageRank case [5], 
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among others. Yet to the best of our knowledge, the re- 
searches on accelerating HITS are hardly known, partly 
because HITS is originally query-dependent; the sizes 
of the neighborhood graph (generally about 1000-5000 
pages in 1998 [1]) are much smaller than the size of 
the web graph. Thus, faster and more memory-intensive 
methods based on matrix inversion or decomposition, es- 
pecially for sparse symmetric systems can be used [6]. 
And for QI-HITS (where the problem's scale is as enor- 
mous as the PageRank's), some techniques to accelerate 
PageRank computations like Extrapolation methods [7], 
BlockRank algorithm [8], Gauss-Seidel method [9], and 
Reordering methods [10, 1 1] can be adopted because both 
algorithms involving solving dense vector x sparse ma- 
trix operations [29]. 

In this paper, we propose a different approach to accel- 
erate the HITS algorithm. Unhke other methods where 
the acceleration is gained by using techniques borrowed 
from linear algebra (Extrapolation and Gauss-Seidel), ex- 
ploiting adjacency matrix sparseness (Reordering), or uti- 
Uzing nested block structure of the web graph (Block- 
Rank), this method makes use of the HITS definition it- 
self. So, it can be called definition-based acceleration 
method. 

The proposed algorithm introduces two diagonal matri- 
ces, Ca and Ch, which contain constants associated with 
authority and hub scores respectively for every page that 
act as weights to make authority pages more authoritative 
and hub pages more hubby. Because in the web graph 
authoritative pages tend to be pointed to by hubby pages, 
and hubby pages tend to point to authoritative pages, the 
constants will make these pages coUect their scores faster 
and the stationary distributions (authority and hub vec- 
tors) can be reached with less iteration steps. Because 
this approach makes use of the authority and hub pages, 
it can be expected that its performance will be better in a 
graph with considerable proportions of authority and hub 
pages than in a graph with uniform degree distributions. 
As shown in previous works [12-14], the web graph does 
have power law distributions for both indegrees and out- 
degrees, so authority and hub pages exist. 

Note that we concern only the performance gained 
when the approach is applied in conjunction with the 
power method because in the scale of web graph (1 tril- 
lion unique urls in July 2008 based on Google report) only 
matrix-free iterative methods like the power method, Ja- 
cobi, or Gauss-Seidel are feasible to be implemented. Out 
of these methods, the power method is preferable because 
it is the simplest, needs less memory, and is linearly scal- 
able to the problem's size. Further, some promising tech- 
niques Uke Reordering and BlockRank are based on the 
power method. 

2. Related Works 

Some methods to accelerate the PageRank computa- 
tions that can also be implemented in QI-HITS with some 
modifications are discussed in this section. 



Haveliwala [15] suggests using induced ordering from 
the PageRank vector rather than the residual as the stop- 
ping criterion, and shows in 24-million-page Stanford 
WebBase archive dataset the ordering induced by only 
lO"^ iteration agrees fairly well with the ordering induced 
by 100* iteration for query specific case. And in the case 
of global ordering only 25 iterations are needed. 

Arasu et. al. [9] propose using Gauss-Seidel method 
instead of the power method. This method immediately 
updates the entries of the current iteration vector as they 
become available, thus it clearly converges faster than the 
power method. A nice thing about this method is its for- 
mulation resembles the power method's, so it can easily 
be implemented with small modifications to the system. 

Kamvar et. al. introduce two Extrapolation methods to 
accelerate the PageRank computations [7]. The meth- 
ods assume the PageRank vector can be written as hn- 
ear combination of the eigenvectors of the Google matrix, 
a stochastic and primitive version of the adjacency ma- 
trix induced from the web graph. Because the vector is 
the principal eigenvector [30] of the Google matrix, the 
PageRank convergence can be speeded up by subtracting 
some subdominant eigenvectors from the current iteration 
vector. The first method, Aitken Extrapolation, uses suc- 
cessive intermediate vectors to estimate the second eigen- 
vector, and subtracting it from the current iteration vector. 
The second method. Quadratic Extrapolation, estimates 
not only the second but also the third eigenvector, and sub- 
tracting them from the current iteration vector. It has been 
shown that the Quadratic Extrapolation is better than the 
Aitken Extrapolation not only because this method sub- 
stracts more error, but also there are cases where the sec- 
ond and the third eigenvectors are the repeated vectors, 
so significant improvement can only be achieved by sub- 
tracting both vectors from the current iteration vector. 

In their next work, Kamvar et. al. propose a very 
promising aggregation method called BlockRank that 
speeds up the computation of PageRank by a factor of 
two times in realistic scenarios [8]. This method works 
well because the web graph has a nested block structure; 
most pages within a host intrahnk to other pages within 
the host, and only a few are interhost links. The method 
first calculates local PageRank vector for each host by ig- 
noring the interhost links. And a global PageRank vec- 
tor of hostgraph, a graph created by taking the hosts as 
the vertices and the interhost links as the edges, is calcu- 
lated. The global PageRank vector is then used to weight 
the corresponding local PageRank vectors. The result is 
taken as a starting vector for the standard PageRank al- 
gorithm. Because of this locality approach, BlockRank 
favors parallelization scheme, thus is very suitable to be 
implemented in the real condition. 

Reordering [10, 11] is another very promising method 
to speed up the PageRank calculations, both in the costs 
per iteration and the number of iterations. The improve- 
ment achieved by this method cannot be worse than the 
original algorithm, and in some datasets, the speedup can 
reach a factor of 5 times [11]. This method works by ex- 
ploiting dangling pages, pages with no outlink that usually 
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Table 1. Similarity measures between vectors. 



Data 


Auth. 


- Indeg. 


Hub 


- Outdeg. 


Cosine 


Spearman 


Cosine 


Spearman 


britaniiica.com 


0.9776 


0.9614 


0.9558 


0.9409 


jobs.ac.uk 


0.9981 


0.7590 


0.5326 


0.9828 


opera.com 


0.9337 


0.5377 


0.5878 


0.9904 


python.org 


0.8629 


0.6212 


0.1943 


0.9722 


scholarpedia.org 


0.9999 


0.9986 


0.6339 


0.9983 


stanford.edu 


0.8662 


0.3794 


0.5968 


0.9635 


en.wildpedia.org 


0.9452 


0.7323 


0.8306 


1.0000 


yahoo.com 


0.5654 


0.4968 


0.4158 


0.9950 


Average 


0.8936 


0.6858 


0.5934 


0.9804 



make up over 80% of the webpages. In the adjacency ma- 
trix representation the dangUng pages are the zero rows, 
so they look alike and can be lumped together into a tele- 
portation state. Consequently, the problem turns into solv- 
ing PageRank for nondangling pages [10]. More recent 
work by Langville and Meyer [11] provides linear algebra 
approaches to this problem. They suggest reordering ad- 
jacency matrix so that the rows corresponding to the dang- 
ling pages are placed at the bottom of the matrix, then the 
PageRank vector is computed only for nondangling por- 
tion. The scores of the dangUng pages are recovered by 
using the vector of nondangUng pages and forward sub- 
stitution. 



3. Proposed Algorithm 

3.1. Formulation 

We will tirst review the definition and mathematical 
model of HITS before deriving the proposed algorithm 
formulation. HITS is defined with the following state- 
ment: authority score of a page is the sum of hub scores 
of others that point to it, and hub score of a page is the 
sum of authority scores of others that are pointed to by it 
[1]. This is a circular statement, the authority scores de- 
pend on the hub scores and vice versa. To solve it, every 
page must be given initial scores, and final scores are com- 
puted by successively repeating the summing processes 
with normahzation until a predefined criterion is satisfied. 
The following equation gives the formulation of HITS: 

af+i)= £/,f, and /.f = £ flf (1) 

for k — Q,l,...,K—\, where K denotes the final iteration 

where the predefined criterion is satisfied, a\^^ and h\^^ 
denote the authority and the hub score of page i at iteration 
k, is the set of pages that point to i, and is the set 
of pages that are pointed to by /. 

As shown in eq. 1, HITS simply calculates the author- 
ity (hub) score of a page by adding hub (authority) scores 
of other pages that point to (are pointed to by) it. This 
original formulation misses an underlying important as- 
pect of the preferential attachment in the web graph; the 
portal pages (pages with many outlinks) tend to point to 
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Fig. 1. The distances between initial and tinal distributions. 



the popular pages (pages with many inlinks), and the pop- 
ular pages tend to get many new inlinks. This preferential 
attachment is the main reason behind the skewed distribu- 
tions of indegrees and oudegrees as reported in many ex- 
periments [12-14], and because the authority (hub) score 
of a page is correlated to its indegree (outdegree) [16, 17], 
it can be expected that the stationary distributions of the 
authority and the hub scores are also skewed. We con- 
firm this fact empirically by calculating the similarities 
between authority vs. indegree vectors and hub vs. out- 
link vectors. Table 1 shows the results, while cosine cri- 
terion measures the distance between two vectors. Spear- 
man correlation measures the similarity between order- 
ings induced from the vectors (see section 4.1 for details 
about the datasets). 

Usually, a uniform distribution is used as the start- 
ing vector. Thus, the distances between initial and final 
scores are not uniform. For some very authoritative and 
hubby pages, it takes more iteration steps to reach the final 
scores. This is also true for pages that have very low final 
authority or hub scores. Fig. 1 describes such condition; 
the distances between initial and final scores of pages that 
ordered in the top and bottom are greater than pages in the 
middle positions. 

The proposed algorithm is formulated to deal with this 
problem. It measures the distances between initial and 
final scores, and sets the convergence velocities propor- 
tional to the distances. As stated earlier the final authority 
and hub scores can be roughly approximated by using in- 
degree and outdegree distributions, so it will be utilized 
to create constants that determine the convergence veloc- 
ities. 

Let cfl, and c/i,- be the constants associated with the au- 
thority and hub score of page /, and make some observa- 
tions before writing down the formulations. Clearly, cat 
must be bigger than c/i, if / has many inlinks than outlinks, 
chi must be bigger than ca, if / has many outlinks than in- 
links, and cat must be equal to chi if i has the same number 
of inlinks and outlinks. Also, the addition of a new link 
to pages with small number of Unks should have greater 
impact than highly connected pages. By using these ob- 
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servations, we define the constants as: 
indeS' 

cat — — ^-lindeg; -outdeg.l^' , and (2) 

deg, 

chi — ^^^^^^|indeg, — outdeg,|~^', where (3) 
deg/ 

{1 if indeg,- > outdeg, 
- 1 if indeg, < outdeg, 
otherwise 

where indeg,, outdeg,, and deg; denote the indegree, out- 
degree, and degree of i respectively, and | * | denotes the 
absolute value of *. And the proposed algorithm is de- 
fined with the following equation: 

and hf~^^^ = ^ af'^^^caj (4) 

As shown above, hj and Uj are weighted with chj and 
cuj respectively. Because cuj and chj are proportional to 
indeg, and outdeg, which in turn are likely to be propor- 
tional to the distances between final and initial authority 
and hub vectors, these constants tend to make the portal 
and popular pages and also the pages located in the bot- 
tom positions collect their scores faster as the iteration 
steps progress. Thus, it can be expected that the proposed 
algorithm will converge faster than HITS. 

The proposed algorithm will be represented in matrix 
notation for some reasons: (1) to simpUfy the formula- 
tion, (2) to allow graph properties being seen in linear al- 
gebra perspective, (3) to compare its formulation to the 
HITS's and PageRank's, (4) to allow other acceleration 
methods stated previously be applied with ease, and (5) 
to analyze the convergence property (see section 3.4). Let 
L be the adjacency matrix of the web graph where L,^ 
is 1 if there is a link from / to j, and otherwise, Ca = 
diag(cai,ca2, • • - tCOn), Ch = diag{chi,ch2, ■ ■ .,chN), and 

is the number of pages in the web graph. Thus, the pro- 
posed algorithm can be rewritten as: 

a(*+i)^ = hW^ChL, and h(*+i)^ = a(*+i)^CaL^ (5) 

where a^ is 1 xA^ authority vector and is 1 xA^ hub 
vector. 

Algorithm 1, 2, and 3 are used to calculate QI- 
HITS, the proposed algorithm, and PageRank respec- 
tively, where || * |j i denotes 1-norm of *, e denotes the de- 
sired residual level, denotes the 1 x PageRank vector. 
Do = diag(outdegi,outdeg2, . . ., outdeg^) denotes the di- 
agonal outdegree matrix, d denotes A' x 1 dangling vector 
where its (ra = 1,2, . . . ,A^) entry is 1 if n is a dangling 
page and otherwise, and < 05 < 1 denotes a scalar that 
controls the proportion of time a random surfer follows 
the hyperlinks as opposed to teleporting. Algorithm 3 is 
adopted from works by Langville. Detail discussions can 
befoundin[3, 18]. 

3.2. Operational Costs and Memory Requirements 

In QI-HITS, there are two dense vector x sparse ma- 
trix operations for each iteration step. Because L contains 
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Algorithm 1: QI-HITS 

InitiaUze h(*=")^ = e^/A? 
While 5 > e, do: 

NormaUze h(*+i)^ 
5 = ||h(*+i)^-hWr||i 
k = k+l 

Nomalize 

Return and 



Algorithm 2: Prop. Alg. 

Initialize h(*=*')^ = e^/A? 

Calculate Ca and Ch 

While 5>e, do: 
a(*+i)^=hW^ChL 
h(*+i)^=a(*+')^CaL^ 
NormaUze h(''+^)^ 
5 = ||h(*+i)r_hWri|i 
k = k+l 

Normalize a^ 

Return a^ and h^ 



Algoritlmi 3: PageRank 
Initialize p(*=o)^ = /N 
While 5 > e, do: 

p(*+i)r = apW^^Do^'L-h (apW^d-h 1 - a)e^ /N 

5 = ||p(*+i)r_pWr||j 

k = k+l 
Return p^ 



only either 1 or 0, the cost of each multiplication is nnz(L) 
additions, where nnz(L) denotes the number of nonzero of 
L. And the normalization step needs A^ multiplications. 
Thus, QI-HITS needs A' multiplications and 2nnz(L) ad- 
ditions per iteration. 

In the proposed algorithm, a and h are multiplied by 
Ca and Ch respectively, so there are additional 2A' multi- 
plications. Thus, the proposed algorithm needs 3A^ multi- 
plications and 2nnz.(L) additions per iteration. 

In PageRank, p is multiplied by Do~ ' which needs A'^ 
multiplications. Then the result is multiplied by L, which 
needs nnz.(L) additions. Further there are also additional 
adjustments (stochasticity and primitivity) to ensure the 
convergence of the result, which need |ND| multiplica- 
tions and (A' + |ND|) additions, where |ND| denotes the 
number of nondangling pages. Because there is no need 
to do normalization in PageRank, the costs are A'^ + |ND| 
multiplications and (nnz(L) + N + |ND|) additions per it- 
eration. Table 2 summarizes the costs. 

The memory requirements for L is nnz(L) booleans; 
Y^{k)T^ g^{k+i)T^ l^{k+l)T^ ^ doubles each; 
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Table 4. Operational costs for back button datasets. 



Algorithm Mult. Addition 

QI-HITS N 2nnz(L) 

Prop. Alg. 2nnz(L) 

PageRank N + |ND| nnz(L) + N+ |ND| 



Algorithm Mult. Addition 

QI-HITS N 2nnz(L*) 

Prop. Alg. 3N 2nnz(L*) 

PageRank N mz{L*) + N 



Table 3. Memory requirements for original datasets. 



Algorithm 


Memory 


QI-HTTS 

Prop. Alg. 
PageRank 


nnz(L) boots and 3N doubles 
nnz(L) boots and 5N doubles 
nnz(L) + |ND| boots, N integers, and 2N doubles 



Table 5. Memory requirements for back button datasets. 



Algorithm Memory 

QI-HTTS «nz(L*) bools and 3N doubles 

Prop. Alg. nnz(L* ) bools and 5^ doubles 

PageRank Mnz(L*) bools, N integers, and 2N doubles 



Do is N integers; and d is |ND| booleans. Table 3 sum- 
marizes the required memory. 

3.3. The Back Button Model 

The dangUng pages can cause computational issues 
both for PageRank and HITS. In the PageRank case, the 
score of a page is originally defined as the proportion of 
time the random surfer spends on the page after following 
the hyperlink structure of the web infinitely [2]. If the web 
graph contains the dangling pages, the random surfer will 
not be able to traverse all pages because it will be trapped 
in a dangling page if encountering it. Consequently, not 
only the scores can only be defined for the pages that have 
been visited, but also due to the finite time of the obser- 
vation, these scores will not reflect the real values of the 
pages. 

In hnear algebra perspective, the danghng pages make 
the web graph not strongly connected; the adjacency ma- 
trix induced from the web graph is reducible. And by 
Perron theorem for nonnegative matrices [3,19], the dom- 
inant eigenvector of a nonnegative reducible matrix exists 
but is not necessarily positive and unique. Thus, there is 
no guarantee a unique and positive PageRank vector exists 
for the original PageRank problem; finding the dominant 
eigenvector of (Do~'L)^ [31]. 

The same condition goes for HITS, both authority ma- 
trix L^L (a(*+i)^ = a(^)^L^L) and hub matrix LL^ 
_ hW^LL^) are nonnegative. Thus by Perron 
theorem for nonnegative matrices, and exist but 
there is no guarantee of the uniqueness [32]. 

In addition to the non-uniqueness, there is also problem 
related to the final distributions as in the PageRank case. 
The dangling pages receive scores from others that point 
to them but do not share their scores (because they have 
no outdegree), so they will always become authoritative 
and have no hub scores. Consequently, the authority and 
hub distributions will be more skewed as the number of 
danghng pages increases, and the average distances be- 
tween initial distributions (usually uniform distributions) 
and final stationary distributions in the web graph with 
many dangling pages are greater than in the web graph 
without danghng page. Thus, the convergence rates tend 



to become slower as the number of the danghng pages 
increases. 

Our experiments confirm this effect. As shown in 
Fig. 2, the PageRank convergence rates are almost the 
same or better than HITS, which do not agree with previ- 
ous experiments where the researchers usually remove the 
dangling pages from the datasets [1, 16, 17, 20, 21]. And 
in the datasets without the dangling page (Fig. 3), the re- 
sults agree with the previous experiments; HITS usually 
converges faster than PageRank. 

To deal with the dangling pages, instead of removing 
them, we prefer to use the back button model [22-24]. 
This is because^r.??, web graph datasets usually have high 
percentages of the dangling pages, so great portion of use- 
ful data will be lost if they are removed. Second, some of 
the dangling pages are the important pages [25], so re- 
moving them can bias the results. And third, most users 
usually go back to the previous page when encountering 
a dangling page, so this is a natural way in modeling 
the web graph. Mathematically, the back button model 
rewrites L into L* = L + M, where M is A'^ x matrix 
with row / is equal to column j of L if / is a danghng page, 
and OixN otherwise. 

Table 6 shows the average fractions of authoritative and 
hubby pages in the original and the back button datasets, 
where /i' denotes the average fraction of indegree and fo 
denotes the average fraction of outdegree. The original 
datasets in average have authoritative pages more than 
93% (fi > 0.6) and this percentage is almost unchanged 
for very authoritative pages (fi > 0.9). As shown in Table 
1, the average percentage of the dangling pages is 92.9%, 
so almost all authoritative pages are the dangling pages. 
Conversely, the number of hubby pages is only about 6% 
and is also almost unchange for very hubby pages. 

When the back button model is applied, the percent- 
ages of authoritative pages drop significantly and are com- 
parable to the percentages of hubby pages. Because this 
model turns the dangling pages into nondangling ones, the 
remaining pages are the real authority; pages with many 
indegrees than outdegrees, not pages that have only inde- 
grees. 

The costs and memory requirement of QI-HITS, the 
proposed algorithm, and PageRank for back button 
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Table 6. Average fractions of authoritative and hubby pages. 



Data 


Authoritative page 


Hubby Page 


> 0.6 > 0.7 > o.s > 0.9 


fo > 0.6 fo > 0.7 fo > 0.8 fb > 0.9 


Original 
Back button 


0.9334 0.9320 0.9310 0.9303 
0.0189 0.0041 0.0022 0.0013 


0.0661 0.0644 0.0613 0.0556 
0.0503 0.0373 0.0253 0.0102 



datasets are shown in Table 4 and 5. 

3.4. Convergence Analysis 

To analyze the convergence property of the proposed 
algorithm, the eq. 5 will be rewritten in one vector x ma- 
trix representation instead of two. Let X = CaL^ChL, 
the authority vector of the proposed algorithm can be 
rewritten into: 

a('=+i)^ = a«^X (6) 

and after the power method applied to the above equation 
has converged, the hub vector can be revived by calculat- 
ing = a^CaL^. Consequently, the problem of finding 
authority and hub vectors can be reduced into only cal- 
culating dominant eigenvector of (note that HITS can 
also be rewritten into this style [29]). 

Because X^ is nonnegative, it always has a nonnegative 
dominant eigenvalue Ai such that the moduli of all other 
eigenvalues {X2, ■ ■ ■ ,Xk} do not exceed Ai, and a dom- 
inant eigenvector corresponding to Ai can be chosen so 
that every entry is nonnegative (see theorem 3.6 in [19]). 
Thus, the existence of the authority vector of the proposed 
algorithm is guaranteed. But depending on the initial- 
ization, it may be not unique since Ai may be repeated 
[3,19]. 

To guarantee the uniqueness of the proposed algo- 
rithm, the matrix X must be modified into a positive 
matrix. The positive version of X can be written as: 
X = CX -h (1/A^)(1 - C) ee^, where < C < 1 is a con- 
stant that should be set near to 1 to preserve the hyperhnk 
structure information. And the proposed algorithm can be 
rewritten as: 

a(^+i)^=aW^X (7) 

Because X is not stochastic, a^ must be normahzed for 
each iteration step. 

By Perron theorem for positive matrices [3, 19], a 
unique and positive principal eigenvector of X^ (the au- 
thority vector of the proposed algorithm) is guaranteed to 
exist. And by ensuring the starting vector not in the range 
(X^ — All), the power method applied to eq. 7 is guaran- 
teed to converge to this vector [3]. In particular, all pos- 
itive vectors satisfy the requirement as the starting vector 
[26]. 

Note that in addition to guaranteeing the uniqueness, 
this modification also tackles the second less obvious 
problem associated with the proposed algorithm (and also 
HITS); producing ranking vectors that inappropriately as- 
sign zero scores to some pages [19]. 

In practice however, as with the HITS case, usually the 
proposed algorithm can be used without this modification 



because the scores from the link structure ranking algo- 
rithms will be combined with other scores like contents 
and hypertext scores, making the final scores less sensitive 
to the ranking vectors. And actually based on our experi- 
ments, the HITS and the proposed algorithm do converge 
to unique nonnegative vectors for all datasets (including 
the back button datasets) with /N sls the starting vector. 

4. Experimental Results 

In this section the performance of the proposed algo- 
rithm is evaluated by comparing its convergence rates and 
processing times to reach the same corresponding residual 
level with the results of QI-HITS and PageRank. The suit- 
ability of the proposed algorithm in approximating HITS 
is confirmed in section 4.4. And the examples of the top 
pages returned by the algorithms are given in section 4.5. 

The experiments are conducted by using a notebook 
with 1 .86 GHz Intel processor and 2 GB RAM. The codes 
are written in python by extensively using database to 
store lists of adjacency matrix, score vectors, and other 
related data in harddisk, and open sourced as the part of 
our work in developing a simple web search engine for 
research purposes [27]. 

4.1. The Datasets 

There are 8 datasets used in the experiments that consist 
around 10 thousands to 225 thousands pages with average 
degrees from around 4 to 47. Except Wikipedia [28], all 
datasets were crawled by using our crawling system [27]. 
All datasets, but Britannica, have a typical web graphs av- 
erage degree, around 4 to 15 [3, 8]. However, the percent- 
ages of the dangling pages are quite higher here than in a 
typical dataset (around 70% to 85%) due to the high num- 
ber of downloaded but unexplored pages. Table 7 summa- 
rizes the datasets, where %DP denotes percentage of the 
danghng pages, and AD denotes average degree. 

4.2. Convergence Rates 

Fig. 2(a)-2(h) and 3(a)-3(h) show the convergence rates 
for the original and the back button datasets respectively. 
The horizontal axis is the number of iteration and the ver- 
tical axis is the residual. In the original datasets, the pro- 
posed algorithm converges faster than HITS (except for 
yahoo dataset, where the percentage of the dangling pages 
is too high), but generally still cannot beat PageRank. 
In the back button model, where the dangling pages are 
forced to have outdegree, the proposed algorithm gives 
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Table 7. Datasets summary. 



Data 


Crawled 


Pages 


Links 


%DP 


AD 


britannica 


09/2008 


oil r\A 
111 U4 




OJ.tJ 


4/.i 


jobs 


12/2008 


16056 


187957 


92.0 


1L7 


opera 


12/2008 


49749 


437748 


95.4 


8.8 


python 


09/2008 


57328 


449529 


93.5 


7.8 


scholarpedia 


06/2008 


74243 


1077781 


86.5 


14.5 


Stanford 


12/2008 


225441 


2196441 


96.7 


9.7 


wikipedia 


09/2006 


10431 


46152 


96.1 


4.4 


yahoo 


12/2008 


34054 


161700 


98.0 


4.7 


Average 


61050 


693983 


92.9 


13.6 



Table 8. Average similarity measures. 



Data 


Authority 


Hub 


Cosine Spearman 


Cosine Spearman 


Original 
Back button 


0.859 0.810 
0.912 0.794 


0.976 0.999 
0.945 0.861 



very promising results, faster than both HITS and Page- 
Rank for all datasets. Further, in the back button model 
generally HITS converges faster than PageRank, which 
agrees with the previous works [1, 16, 17, 20, 21]. 

4.3. Processing times 

Fig. 2(i) and 3(i) show the processing times (in second) 
to achieve the same corresponding residual level for the 
original and the back button datasets respectively. In the 
original datasets, in average PageRank is the fastest and 
HITS is the slowest to converge. And in the back button 
model, the proposed algorithm becomes the fastest in five 
out of eight cases, PageRank is the slowest in six out of 
eight cases, and HITS gives moderate performances. The 
performances of the back button model also agree with 
the previous works [1, 16, 17,20, 21] where HITS needs 
less processing time than PageRank. 

4.4. Similarity measures 

The similarity measures between the authority and hub 
vectors of QI-HITS and the proposed algorithm are shown 
in Table 8. The purpose of the measurements is to confirm 
the suitabihty of the proposed algorithm in approximating 
the results of QI-HITS. As shown there, the proposed al- 
gorithm gives good approximations to the authority vec- 
tors, and very good ones to the hub vectors. The measures 
are the best in the hub vectors of the original datasets. Be- 
cause Spearman correlation gives about 0.999, it can be 
conferred that the proposed algorithm returns (almost) ex- 
actly the same ordering as QI-HITS. This is not surprising 
because in the original datasets more than 90% pages are 
the danghng pages that have no hub scores. 

4.5. Top pages 

Table 9 and 10 give examples of the results returned by 
the authority vectors of the three algorithms for wikipedia 



dataset without query and with query "programming" re- 
spectively. Note that for brevity only file names are dis- 
played. To get full URLs, each name has to be prefixed 
with "http://en.wikipedia.org/wiki/". 

5. Conclusions and Future Researches 

The proposed algorithm which makes use the defini- 
tion of authority and hub can be used to accelerate the 
HITS algorithm. While in original datasets it converges 
only faster than HITS, in back button datasets it converges 
faster than both PageRank and HITS. Further, generally 
there are also some improvements in the processing times, 
especially for the back button datasets. 

The non-uniqueness problem due to the reducibility of 
the authority matrix X can be ehminated by forcing the 
matrix into a positive matrix X. This modification not 
only guarantees the uniqueness, but also tackles the sec- 
ond less obvious problem; producing ranking vectors that 
inappropriately assign zero scores to some pages. 

Based on the similarity measurements, it can be con- 
cluded that the vectors produced by the proposed algo- 
rithm can be used to approximate the QI-HITS's vectors. 
And if the QI-HITS vectors are desired instead, the QI- 
HITS algorithm can be run by using these vectors as the 
starting vectors for a few last iteration steps. In the case 
of QI-HITS where the problem involving calculating the 
stationary vectors of the enormous adjacency matrix of 
the web graph, even a few number of iteration steps are 
worth many resources because it takes days to finish the 
calculations. 

There are some interesting future researches related to 
this work. First, as stated in section 1 and 2, the re- 
searches on accelerating the QI-HITS computations are 
hardly known. The remarkable similarity between the 
PageRank and QI-HITS formulation [29] implies that QI- 
HITS can be accelerated by utihzing the methods dis- 
cussed in section 2 in conjunction to the proposed algo- 
rithm. 

And second, as shown in eq. 6, the HITS algorithm can 
be accelerated by introducing Ca and Ch into the origi- 
nal authority matrix L^L. While we calculate the entries 
of Ca and Ch based on the HITS definition of authority 
and hub scores, other schemes like dynamically updating 
the entries by using the differences between the vectors of 
current and previous iteration or using previously calcu- 
lated ranking vectors as the entries probably can also be 
used [33]. 
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Fig. 2. Convergence rates and processing times for original datasets. 



Table 9. Top 10 results without query for wikipedia dataset. 



No. 


PageRank 


HITS 


Prop. Alg. 


1 


Main_Page 


Main_Page 


Main_Page 


2 


Programming_language 


Programming_language 


Programming_language 


3 


Computer_language 


C_programming_language 


2006 


4 


C_programming_language 


Java_programming_language 


C_programming_langiiage 


5 


Java_programming_language 


C%2B%2B 


Operating_system 


6 


Ob : ect oriented__programming 


Ope rat ing_sys tern 


Micro so ft_Windows 


7 


Compiler 


Micros of t_Windows 


Unix 


8 


C%2B%2B 


Ob ject -orient ed__programming 


Linux 


9 


Operating_system 


Unix 


2005 


10 


Micros of t_Windows 


Programming_paradigin 


Java_programining_language 
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Fig. 3. Convergence rates and processing times for back button datasets. 



Table 10. Top 10 results with query "programming" for wikipedia dataset. 



No. 


PageRank 


HITS 


Prop. Alg. 


1 


Programming_language 


Programming_language 


Programming_language 




Categorical_list_of_programniing_languages 


Categorical_list_of_programniing_languages 


Categorical_list_of_prograrfiming_languages 


3 


Functional_programraing 


C_programining_language 


C_programraing_language 


4 


Ob ject -orient ed__programming 


Functional_programming 


Functional_programining 


5 


C_programming_language 


Ob ject -orient ed_programining 


Ob ject-oriented_programming 


6 


Gener LC_programming 


Programming_paradigm 


Java_programining_language 


7 


Programning_paradign 


Java_progranming_language 


Programming_parad.igm 


8 


Java_prGgranming_-anguage 


Generic_programning 




9 


Lisp_programming_language 


Lisp_programming_language 


Lisp_programming_language 


10 


Logic_programining 


Ada_programTning_language 


Ada_prograinming_language 
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